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Abstract —In this work, we consider multitask learning prob¬ 
lems where clusters of nodes are Interested in estimating their 
own parameter vector. Cooperation among clusters is beneficial 
when the optimal models of adjacent clusters have a good number 
of similar entries. We propose a fully distributed algorithm for 
solving this problem. The approach relies on minimizing a global 
mean-square error criterion regularized by non-differentlable 
terms to promote cooperation among neighboring clusters. A 
general diffusion forward-backward splitting strategy is intro¬ 
duced. Then, it is specialized to the case of sparsity promoting 
regularlzers. A closed-form expression for the proximal operator 
of a weighted sum of -norms is derived to achieve higher 
efficiency. We also provide conditions on the step-sizes that ensure 
convergence of the algorithm in the mean and mean-square error 
sense. Simulations are conducted to Illustrate the effectiveness of 
the strategy. 


I. Introduction 

We consider the problem of distributed adaptive learning 
over networks to simultaneously estimate several parameter 
vectors from noisy measurements using in-network processing. 
Depending on the number of parameter vectors to estimate, 
we distinguish between single-task networks and multitask 
networks. In a single-task scenario, the entire network aims to 
estimate a common parameter vector for all nodes. The nodes 
are allowed to exchange information with their neighbors to 
improve their own estimates. Then, the estimates are combined 
in order to achieve the solution of the problem. Different 
cooperation rules have been proposed and studied in the 
literature |lT|-p7|. Diffusion strategies pj-pT] are partic¬ 
ularly attractive since they are scalable, robust, and enable 
continuous learning and adaptation in response to concept 
drifts. They have also been shown to outperform consensus 
implementations over adaptive networks when constant step- 
sizes are employed to enable continuous adaptation 

(D- 

In this work, we are interested in distributed estimation over 
multitask networks; nodes are grouped into clusters, and each 
cluster is interested in estimating its own parameter vector (i.e., 
each cluster has its own task). Although clusters may generally 
have distinct though related tasks to perform, the nodes 
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may still be able to capitalize on inductive transfer between 
clusters to improve their estimation accuracy. Such situations 
occur when the tasks of nearby clusters are correlated, which 
happens, for instance, in monitoring applications where agents 
in a network need to track multiple targets moving along 
correlated trajectories. Multitask diffusion estimation problems 
of this type have been addressed before in two main ways. 

In a hrst scenario, no prior information on possible rela¬ 
tionships between tasks is assumed and nodes do not know 
which other nodes share the same task. In this case, all nodes 
cooperate with each other as dictated by the network topology. 
It was shown in | fTT[ that the diffusion iterates will end up 
converging to a Pareto optimal solution corresponding to a 
multi-objective optimization problem. If, on the other hand, 
the only available information is that clusters may exist in the 
network (but their structures are not known), then extended 
diffusion strategies can be developed i|Tg-@ for setting 
the combination weights in an online manner in order to 
enable automatic network clustering and, subsequently, to limit 
cooperation between clustered agents. In a second scenario, it 
is assumed that nodes know which clusters they belong to. 
In this case, multitask diffusion strategies can be derived by 
exploiting this information on the relationships between tasks. 
A couple of useful works have addressed variations of this 
scenario. For example, in p3| , a diffusion LMS strategy esti¬ 
mates spatially-varying parameters by exploiting the spatio- 
temporal correlations of the measurements at neighboring 
nodes. In p4) , it is assumed that there are three types of 
parameters: parameters of global interest to all nodes in the 
network, parameters of common interest to a subset of nodes, 
and a collection of parameters of local interest. A diffusion 
strategy was developed to perform estimation under these 
conditions. A similar work dealing with incremental strategies 
instead of diffusion strategies appears in p5| . Likewise, in 
the works |26|, distributed algorithms are developed to 
estimate node-specific parameter vectors that lie in a common 
latent signal subspace. In another work | |28) , the parameter 
space is decomposed into two orthogonal subspaces, with 
one of the subspaces being common to all nodes. There is 
yet another useful way to exploit and model relationships 
among tasks, namely, to formulate optimization problems 
with appropriate co-regularizers between nodes. The strategy 
developed in adds squared £ 2 -norm co-regularizers to the 
mean-square-error criterion in order to promote smoothness 
of the graph signal. Its convergence behavior is studied over 
asynchronous networks in p0| . 

In some applications, however, such as cognitive radio 
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and remote sensing ]29| , it may happen that the optimum 
parameter vectors of neighboring clusters have a large number 
of similar entries and a relatively small number of distinct 
components. In this work, we build on the second scenario 
where the composition of the clusters is assumed to be known 
and where nodes know which cluster they belong to. It is 
then advantageous to develop distributed strategies that involve 
cooperation among adjacent clusters in order to promote and 
exploit such similarity. Although the current problem seems 
to be related to the problem studied in p9) , it should be noted 
that the differentiable regularizers used itip9) are not effective 
when sparsity promoting regularization is required. Moreover, 
when neighboring nodes belonging to different clusters are 
aware of the indices of common and distinct entries, and 
when these indices are fixed over time, one may appeal to the 
multitask diffusion strategies developed in p4) , p8) . However, 
in the current work, we are interested in solutions that are 
able to handle situations where the only available information 
is that the optimum parameter vectors of neighboring clusters 
have a large number of similar entries. A multitask diffusion 
algorithm with £i-norm co-regularizers is proposed in |311 to 
address this problem leading to a subgradient descent method 
distributed among the agents. The aim of this work is to 
introduce a more general approach for solving such convex but 
non-dijferentiable problems by employing instead a diffusion 
forward-backward splitting strategy based on the proximal 
projection operator. Before proceeding, we recall the forward- 
backward splitting approach in a single-agent deterministic 
environment 

Consider the problem 


TtAn f{x) + g{x) (1) 

ccGR" 

with / a real-valued differentiable convex function whose 
gradient is /3-Lipschitz continuous, and g a real-valued con¬ 
vex function. The proximal gradient method or the forward- 
backward splitting approach for solving Q is given by the 
iteration ||^, | |M1 : 

x{i + 1)= pmx^g{x{i) - nWf{x{i))), (2) 

where p is a constant step-size chosen such that p, G (0,2/3“^] 
to ensure convergence to the minimizer of Q- The gradient- 
descent step is the forward step (explicit step) and the proximal 
step is the backward step (implicit step). The proximal operator 
of pg{x) at a given point v G is a real-valued map given 
by 10: 

Pi'o^ugM = argmin g(x) -f ^||a; - -uf. (3) 

Since the proximal operator needs to be calculated at each 
iteration in it is important to have a closed form expres¬ 
sion for evaluating it. In this work, we derive a multitask 
diffusion adaptation strategy where each node employs this 
approach for minimizing a cost function with sparsity based 
co-regularizers. Instead of using iterative algorithms for eval¬ 
uating the proximal operator of a weighted sum of -norms 
at each iteration | [^ , we shall instead derive a closed form 
expression that allows us to compute it exactly. We shall also 


examine under which conditions on the step-sizes the proposed 
multitask diffusion strategy is mean and mean-square stable. 
Simulations are conducted to show the effectiveness of the 
proposed strategy. An adaptive rule to guarantee an appropriate 
cooperation between clusters is also introduced. 

Notation. In what follows, normal font letters denote 
scalars, boldface lowercase letters denote column vectors, and 
boldface uppercase letters denote matrices. We use the symbol 
(•)^ to denote matrix transpose, the symbol to denote 

matrix inverse, and the symbol Tr(-) to denote the trace 
operator. The operator col{ } stacks the column vectors entries 
on top of each other. The symbol ® denotes the Kronecker 
product operation. The identity matrix of size NxN is denoted 
by Tat. The N x M matrices of zeros and ones are denoted 
by Oatxm lAfxM. respectively. The set A4 denotes the 
neighbors of node k including k. The set Mjf denotes the 
neighbors of node k excluding k. Finally, Ci denotes the set 
of nodes in the i-th cluster and C{k) denotes the cluster to 
which node k belongs. 


II. Multitask diffusion LMS with 
Forward-Backward splitting 

A. Network model and problem formulation 

We consider a network of N nodes grouped into Q con¬ 
nected clusters in a predefined topology. Clusters are assumed 
to be connected, i.e., there exists a path between any pair 
of nodes in the cluster. At every time instant i, every node 
k has access to a zero-mean measurement dk{i) and a zero- 
mean M X 1 regression vector Xk{i) with positive covariance 
matrix (*)} > 0- We assume the data to 

be related via the linear model; 

dkii) = Zkii), (4) 


where is the M xl unknown parameter vector, also called 
task, we wish to estimate at node k, and Zk{i) is a zero-mean 
measurement noise of variance f., independent of X({j) for 
all i and j, and independent of Z(^{j) for t k or i j. We 
assume that all nodes in a cluster are interested in estimating 
the same parameter vector, namely, = tug whenever k 
belongs to cluster Cq. However, if cluster Cp is connected to 
cluster Cq, that is, there exists at least one link connecting a 
node from Cp to a node from Cq, vectors tug and tug are 
assumed to have a large number of similar entries and only a 
relatively small number of distinct entries. Cooperation across 
these clusters can therefore be beneficial to infer tug and tug . 

Considerable interest has been shown in the literature 
about estimating an optimum parameter vector tu° subject 
to the property of being sparse. Motivated by the well- 
known LASSO problem p5| and compressed sensing frame¬ 
work p^ , different techniques for sparse adaptation have 
been proposed. For example, the authors in p7) , p8j promote 
sparsity within an LMS framework by considering regularizers 
based on the fi-norm, reweighed ^i-norm, and convex approx¬ 
imation of fg-norm. In projections of streaming data onto 
hyperslabs and weighted £i balls are used instead of minimiz¬ 
ing regularized costs recursively. Proximal forward-backward 
splitting is considered in an adaptive scenario in |40|. In the 
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context of distributed learning over single-task networks, dif¬ 
fusion LMS methods promoting sparsity have been proposed. 
Sparse diffusion LMS strategies using subgradient methods 
are proposed in pT) 
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and using proximal methods are 


proposed in ||44)- |46| . In |47|, the authors employ projection- 
based techniques |39| to derive distributed diffusion algorithms 
promoting sparsity, and in p 8 ) a diffusion LMS algorithm for 
estimating an s-sparse vector is proposed based on adaptive 


greedy techniques similar to |49|. These techniques estimate 


the positions of non-zero entries in the target vector, and then 
perform computations on this subset. More generally, diffusion 
strategies based on proximal gradient for minimizing general 
costs (not necessarily mean-square error costs) and subject to 
a broader class of constraints on the parameter vector to be 
estimated (including sparsity) are derived in | |46| . 

Our purpose is to derive an adaptive learning algorithm 
over multitask networks where optimum parameter vectors of 
neighboring clusters share a large number of similar entries 
and a relatively small number of distinct entries. Consider 
nodes k and £ of neighboring clusters C{k) and C{i), and let 
5k,e denote the vector difference wc(k)—wc(t)- Promoting the 
sparsity of 5k,t can be performed by considering the pseudo 
^o-norm of 5k,i as it denotes the number of nonzero entries. 
Nevertheless, ||^fe,^||o is a non-convex co-regularizer that leads 
to computational challenges. A common alternative is to use 
the fi-norm regularization function defined as 


M 


fl{5k,t) = ||^fe,f||l = ^ l[<^fc/]r 


(5) 


Since the ^i-norm uniformly shrinks all the components of a 
vector and does not distinguish between zero and non-zero 
entries gg, it is common in the sparse adaptive filtering 
framework |3g, ii3g-i|42), ig, |g, gg, ng to consider a 
weighted formulation of the ^i-norm. Weighted fi-norm was 
designed to reduce the bias induced by the fi-norm and en¬ 
hance the penalization of the non-zero entries of a vector pg , 
1^, 1^. Given the weight vector a.ki = ..., 

with > 0 for all m, the weighted fi-norm is defined as: 


Em=ll 0 g(e + which acts like the fo-norm by 

allowing a relatively large penalty to be placed on small 
nonzero coefficients and more strongly encourages them to 
be set to zero. In the sequel, we shall use f{wc{k) ~ t^c(r)) 
to refer to the unweighted or reweighted £i-norm promoting 
the sparsity of wc(k) - wc{i). 

It is sufficient for this work to derive a distributed learning 
algorithm of the LMS type. We shall therefore assume that the 
local cost function Jfe(ruc(/c)) node k is the mean-square 
error criterion defined by: 


Jk{wc(k)) = E{|4(i) - {i)wc(k)\'^}. (9) 

Combining local mean-square-error cost functions and regular¬ 
ization functions, the cooperative multitask estimation problem 
is formulated as the problem of seeking a fully distributed 
solution for solving: 


N 


min J^° {wc^,...,Wcq) = min ^Jkiwc(k)) 

wci,---,wcq 


twCi .■■■,wcr 




N 


E Pkef{wc(k) - 'Wc(i)), 

k=lt^Mk\C{k) 


( 10 ) 


where 77 > 0 is the regularization strength used to enforce spar¬ 
sity. It ensures a tradeoff between fidelity to the measurements 
and prior information on the relationships between tasks. The 
weights pkt > 0 aim at locally adjusting the regularization 
strength. The notation Afk\C{k) denotes the set of neighboring 
nodes of k that are not in the same cluster as k. 

Note that the regularization terms ([^1 and ^ are symmetric 
with respect to the weight vectors wc{k) and Wc(e), that is, 
f{wc{k)-wc(i)) = f{wc(e)-wc(k))- Due to the summation 
over the N nodes, each term f{wc{k) —'Wc(e)) can be viewed 
as weighted by in ( [T0] i. Problem ( [TOl i can therefore 

be written in an alternative way as: 


M 

h{5k,l) = ^ C(Te\[^k,£]m 

m—1 


The weights are usually chosen as: 

1 


<^ke — 




m = 1,... ,M, 


( 6 ) 

(7) 


min 

'WCj^,...,-WCq 


N 


min 

wCi ,..../wcq 


Jk{'Wc{k)) 

/c=l 


N 

E Pkef{we(k) - wc(i)) 

fc=ireAAfc\c(fc) 


( 11 ) 


where <5^ ^ — w^. Since the optimum parameter vectors 

are not available beforehand, we set 


c^kS) = 


- l)]r 


TO = 1, 


,M, ( 8 ) 


at each iteration i, where e is a small constant to prevent 
the denominator from vanishing and 5k,i{i) is the estimate 
of 5°j, g at nodes k and £ and iteration i. This technique, 
also known as reweighted £i minimization [SO) , is performed 
at each iteration of the stochastic optimization process. It 
has been shown in that, by minimizing © with the 
weights one minimizes the log-sum penalty function. 


where the factors {pke} are symmetric, i.e., pki 
are given by: 

A {Pkl + Plk) 

Pkl = --■ 


Pik, and 


( 12 ) 


One way to avoid symmetrical regularization is to consider 
an alternative problem formulation defined in terms of Q 
Nash equilibrium problems as done in |29| with f 2 -norm co- 
regularizers. In this paper, we shall focus on problem ( [TOl l. 

Let us consider the variable wc- of the j-th cluster. Given 
Wc(^e) with £ G Nk \ Cj and k G Cj, the subdifferential of 
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: ■ ■ ■ 7 '^Cq ) in < [TT] ) with respect to wcj is given by; 
d^^T'°''{wc,,...,wcQ) 

= ^yu,cJki'Wc,)+‘2vYl '^Pkid-wcjiwc, -Wc(i)), 

fceCj feeCj^G7Vfc\Cj 

(13) 

where we used the fact that the regularization terms (|^, (|^, 
and the regularization factors {pki] are symmetric. Since 
we are interested in a distributed strategy for solving 
that relies only on in-network processing, we associate the 
following regularized problem [Vj) with each cluster Cj: 


min Jc,.(wtc,) = min V E{|4(i) - xl{i)wc^\^} + 

wr wc J > 

" " fcGCj 


2p E E Pkifiwc^ 

keCj eeAfk\Cj 


wc(e))- 

(14) 


Given tuc(^) with £ G Nk \ Cj, note that the costs in 
problems ( [T0| i and ( [T4l i have the same subdifferential relative 
to ruCj- In order that each node can solve the problem in 
an autonomous and adaptive manner using only local inter¬ 
actions, we shall derive a distributed iterative algorithm for 
solving ( [Tol l by considering ( [l4| ) since both costs have the 
same subdifferential information. 


B. Problem relaxation 

We shall now extend the derivations in 0, @, to 
handle multitask estimation problems with nondifferentiable 
functions. In the sequel, we write Wk instead of ruc(fc) for 
simplicity of notation. First, we associate with each node k 
an unregularized local cost function and a regularized 

local cost function Jj. (•) of the form: 

ctkK{\di{i) - xj{i)wk\^), (15) 

^G7VfcnC(/c) 

j'“(utfe) = Cik^{\dt{i) - xj{i)wk\'^} + 

t?eWfcnC(fc) 

2?7 E Pkifi^k - Wi), 
e&Mk\c{k) 

where NkCC{k) denotes the set of nodes in the neighborhood 
of node k that belongs to its cluster, and {cik} are non-negative 
weights satisfying 

N 

^Cik = l, and C£k = 0 if k (17) 

A.-1 

Note that Wk = whenever £ G A4 H C{k). Both costs 
and consist of a convex combination of mean- 
square errors in the neighborhood of node k but limited to its 
cluster. In addition, expression ( [T6| ) takes interactions among 
neighboring clusters into account. Let us consider node k 
belonging to cluster Cj, i.e., Cj = C{k). It can be checked 
that Jc-iwc-) in ([T^ can be written as: 

Jc, {wc,) = j'riwe) = j‘r(rrtfc) -f dTi^e), 

iGCj eeCj\{k} 

( 18 ) 


The term \{fc} contains terms promoting rela¬ 

tionships between nodes £ G Cj\{k} and their neighbors that 
are outside Cj but not necessarily in the neighborhood of node 
k. To limit these inter-cluster information exchanges to node k 
and its extra-cluster neighbors, we relax X^fGC \{fe} t i'^i) 
X^fGC \{fc}Since ( [T5| ) is second-order differ¬ 
entiable, a completion-of-squares argument shows that each 
can be expressed as Q: 

fr{w,) = (19) 

where the notation ||a;|||. denotes x^Hx for any nonnegative 
definite matrix S, is the minimizer of and Rg 

is given by: 

R^ — ^ ^ CkiRx,k' (20) 

kejGenCit) 

Thus, using ( |T^ , ( [Tsl l, and ( |T^ and dropping the constant 
term we can replace the original cluster cost Cl 

by the following cost function for cluster C{k) at node k: 

j'c(k)i'^k)= Y cik^{\dt{i) - xj{i)wk\^} + 

£G7VfcnC(fe) 

2?? X] 

eeJGk\cik) eec{k)\{k} 

( 21 ) 

Equation ( pTj ) is an approximation relating the local cost 
function J^. (wk) at node k to the global cost function (14i 
associated with the cluster C(fc). Node k cannot minimize (21 1 
directly since this cost still requires global information that 
may not be available in its neighborhood. To avoid access to 
information via multihop, we relax J^^^{wk) by limiting the 
sum in the third term on the RHS of ( |2l] ) over the neighbors of 
node k. In addition, since the covariance matrices Rx,e may 
not be known beforehand within the context of online learning, 
a useful strategy proposed in 0 is to substitute the covariance 
matrices Ri by diagonal matrices of the form bikl m, where 
hik are nonnegative coefficients that allow to assign different 
weights to different neighbors. Later, these coefficients will 
be incorporated into a left stochastic matrix and the designer 
does not need to worry about their selection. Based on the 
arguments presented so far, the cluster cost function at each 
node k can be relaxed as follows: 

Jc{k){wk)= Y cek^{\de{C) - xj {i)wk\‘^} 

t&MkriC{k) 

+ 2r] Y Pkefiwk-'Wi)+ Y bikWwk - . 

i<^Mk\c(k) rGA/'fc"nc(fe) 

( 22 ) 

Since this cost function only relies on data available in the 
neighborhood of each node k, we can now proceed to derive 
distributed strategies. 

The first and third terms on the RHS of ( |2^ are second- 
order differentiable and strictly convex. The second term is 
convex but not continuously differentiable. In pT) , a multitask 
Adapt-then-Combine (ATC) diffusion algorithm was derived 
using subgradient techniques. The purpose of this work is 
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to obtain an iterative algorithm for solving the convex min¬ 
imization problem ( |2^ using a forward-backward splitting 
approach. 


C. Multitask diffusion with forward-backward splitting ap¬ 
proach 

Let Wk{i) denote the estimate of at node k and 
iteration i. Considering a forward-backward splitting strategy 
for solving ( |2^ , we have: 

Wk{i + 1) = prox2^^^~^ . (wk{i) - , 

(23) 

with i/k a positive step-size parameter, 

hA'^k) = Pkif{wk-Wi{i)), (24) 

l(iNk\C{k) 

and denoting the unregularized part of Jc(k)i'^k) 

limited to the first and third terms on the RHS of (|22li. Let 


+ 1) = Wk{i) - VkV^^^J’Ak){wk{i))- (25) 

Node k can run the Adapt-then-Combine (ATC) form of 
diffusion Q for evaluating cf>k{i + 1). Thus, we arrive at the 
following Adapt-then-Combine (ATC) diffusion strategy with 
forward-backward splitting for solving problem ( [TOl i in a fully 
distributed adaptive manner: 

+ 1) ='»"fc(*)+ 

fife ^ cekXe{i)[de{i) - xj{i)wk{i)], 

< i^NkCCik) 

+ 1) = E aikffS + 1), 

eeJ\fknc{k) 

^ Wkii + 1) = + 1)): 

(26) 

where pk = ‘^Vk is introduced to avoid an extra factor of 
2 multiplying Vk and coming from evaluating the gradient 
of squared quantities in J'ci^kiA’k), {aik] are nonnegative 
combination coefficients satisfying: 

N 

^a^fe = l, and o^fc = 0 if f^A4nC(fc), (27) 

f=i 

and 


9k,i+i{wk)= ^ pkAi'^k- (t>S + ^))- (28) 

ieMk\C{k) 


Functions gkA') ( |24| ) and gk^i+i{ ) in ( [28] l are iteration 
dependent through wffi) and 4>ffi -\- 1). Note that we have 
substituted wffi) in ( |24l i by (pffi -f 1) in ( |28] l since -\- 
1) is an updated estimate of wffi) at node £. The proximal 
operator of 'rjp,kgk,i+i{ ) in the third step of ( |26] l needs to 
be evaluated at each iteration z -f 1 and for all nodes k in the 
network. A closed-form expression is recommended to achieve 
higher computational efficiency. We shall derive such closed- 
form expression when / in ( |28l l is selected either as the ii- 
norm or the reweighted £i-norm — see Sec. II-D for details. 

The multitask diffusion LMS 


with forward-backward 
splitting starts with an initial estimate Wk{0) for all k, and 
repeats (|26ll at each instant z > 0 and for all k. In the 


first step of ( |26| l, which corresponds to the adaptation step, 
node k receives from its intra-cluster neighbors their raw 
data {d({i),xffi)}, combines this information through the 
coefficients {c^k}, and uses it to update its estimate Wk{i) to 
an intermediate estimate 'i/);.(z-|-1). The second step in ( |26l ) is 
a combination step where node k receives the intermediate 
estimates {'0^(z -f 1)} from its intra-cluster neighbors and 
combines them through the coefficients {aik} to obtain the 
intermediate value (pkii + l). Finally, in the third step in ( |26] l, 
node k receives the intermediate estimates {(pffi + 1)} from 
its neighbors that are outside its cluster and evaluates the 
proximal operator of the function in ( |28] l at (pkii + 1) to 
obtain Wk{i + 1). To run the algorithm, each node k only 
needs to know the step-size p,k, the regularization strength 
f], the regularization weights {pke}ee^fk\C{k)^ the coeffi¬ 
cients {aik^cekjeGAfknCik) satisfying conditions ([H]) and ^ST) . 
The scalars {aik,cik} and {pki] correspond to weighting 
coefficients over the edges linking node k to its neighbors £ 
according to whether these neighbors lie inside or outside its 
cluster. There are several ways to select these coefficients Q, 
0,0, m- In Section we propose an adaptive rule for 
selecting each regularization weight pki based on a measure 
of the sparsity level of w% — at node k. Finally, note 
that alternative implementations of ( |26] l may be considered. In 
particular, the adaptation step can be followed by the proximal 
step, before or after aggregation as in the possible Adapt-then- 
Combine and Combine-then-Adapt diffusion strategies. 

Algorithm ( |26] l may be applied to multitask problems 
involving any type of coregularizers /(•) provided that the 
proximal operator of a weighted sum of these regularizers can 
be assessed in closed form. In the next section, we shall focus 
on the particular case of sparsity promoting regularizers. 


D. Proximal operator of weighted sum of £i-norms 

We shall now derive a closed form expression for the 
proximal operator of the convex function gk,i+i{wk) in ( |28l l. 
Considering both regularizations addressed in this work, that 
is, the ^i-norm 0 and the reweighted ^i-norm 0, we write: 

M 

gk,i-kli'Wk)= Pki'^ OiTti'^\['Wk]m- [(Piii+^)]m\ 

iGMk\C{k) 

M 

— ^ ^ (['^fc]m) (^9) 

m—l 

where ^k,m,i-i-i{[wk]m) is the iteration-dependent function 
given by: 


^k,m,i+A['^k\ m ) 


— ^ ^ Pkt CX.ktA)\['^k\m (*“bl)]m I • 

eeAfkXCik) 

(30) 


Since gk,i+i{wk) is fully separable, its proximal operator can 
be evaluated component-wise & 


iP^°^vt^kgk.i+A^k{i + A)U 

= P^^^vfkk’t>k.r„.i+i Vto = 1,..., M. 


For clarity of presentation, we shall now derive the proximal 
operator of a function h{-) similar to ^k,m,i-i-i- Next, we shall 
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establish the closed-form expression for prox^^^^^ ^ (•) by 

identihcation. 

Let ft, : K —M be a combination of absolute value functions 
dehned as: 


J .1 

Kx) = Cj hj {x) (32) 

i=i i=i 

with Cj > 0 for all j and bi < b 2 < ■ ■ ■ < bj. Note that 
this ordering is assumed for convenience of derivation and 
does not affect the hnal result. Iterative algorithms have been 
proposed in the literature for evaluating the proximal operator 
of sums of composite functions We are, however, 

able to derive a closed-form expression for p2] i as detailed in 
the sequel. From the optimality condition for ([^, namely that 
zero belongs to the subgradient set at the minimizer prox_;^^(u), 
we have. 


0 G aft(proX;,^(w)) -f ^(proX;,;^(u) - v) 

^v- prox^hix) e A9ft(proX;,;,(u)). 

Since a: G M and Cj are non-negative, we have p4] Chapter 
5: Lemma 10]: 

.7 J J 

d(^ 'Y {x)^ = 'Y (x) = 'Y (^j^lx — bj\. (34) 
i=i i=i i=i 


with 


T — 
-^n,l — 




Ij.2 = 


bn A 


bn A 


J n—1 J n 

{Y^^ X! ~Y^i 

j=n j=l j=n+l j = l 

n = 1,..., J, (38) 

J n J n 

3 =n+l j = l j=n+l j = l 

n = l,...,J -1, (39) 

,7 

(40) 


bj + X Y^ Cj, -boo 
i=i 


Depending on the interval to which v belongs, we evaluate 
the proximal operator according to: 


v + \J2 Cj, 

i=i 

= { bn, 


, J ” \ 

v + x( J2 cj-J2cj), 

^i=n+l j=l ' 


if u G Xo 
if u G X„,i 
if u G X„,2- 


(41) 

In order to make clearer how the operator in ( |4T]i works, we 
plot prox^(7;) for three expressions of ft in Fig. H 

It can be checked that the proximal operator in ( |4T] i can be 
written more compactly as: 


pi'o^xhix) = v - XT{v), 


(42) 


Hence, the subdifferential of the real valued convex function 
h{x) in p^ is: 




dh{x) = < 




if X <bi, 


i=i 


Cl • [-1,1]-^Cj, if x = bi, 

1=2 

,7 

Cl - Ycj, if bi < X <b2, 

1=2 


(35) 


,7-1 

Y 


1=1 

J 


1=1 


[-1,1], if x = bj, 


if X > bj. 


From and p5l l, extensive but routine calculations lead 
to the following implementation for evaluating the proximal 
operator of ft in j3^ . Let us decompose K into J-b 1 intervals 

such that K = U X„ where, as illustrated in Fig. 1 

n=0 


Xo 

x„ 


A 


A 


J 



Xn ^ 1 U T-n , 2; 71 — 1; 


J, 


(36) 

(37) 


where 


r(u) = 

1 (\x ~ b', 

2^\\ X 

n—1 


n— 1 J 

~ Y^ ^3 

1=1 l=n 


n J 

-Eci+^c,|}. 


1=1 l=n+l 


(43) 


Comparing p^ and ( |42l i, we remark that r(7;) is a subgradient 
of ft at proX;^^(u). Based on equation ( pTj ), r(7;) is bounded 
as follows: 

J 

|r(7;)|<^9 (44) 

1=1 

for all V. In fact, equality holds when v belongs to Xq in p6| ) 
or Xj 2 in ( |40| ). When v belongs to an interval of the form of 
X„ 1 in p8|), we have: 


T{v) = 


V -b„ 


n —1 J n J 

^ [Y^3 -Y^3^^^3 - Y 


1=1 


c 


j=n j=l 
J J 

-Y^3,Y* 


j=n+l 


1 = 1 1=1 

and when it belongs to an interval of the form of X„ 2 in 
we have: 


(45) 


r(^) = Y^3 

1=1 


Y 

i='^+i 


J J 


1=1 1=1 


(46) 


We note that the upper bound in ( |44l l is independent of A. 
Using @, the m-th entry of prox^^^g^ (0^(1 -b 1)) in (|^ 
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lo 




II 










^J.2 


hi-\ x; Cj 

i=i 




bj-Xicj- Y, Cj) 
^ j=i ' 


bj + \ Y ^3 
i=i 


Fig. 1. Decomposition of M into J + 1 intervals given by j36^ - j40| . The width of the intervals depends on the weights {cj}^^^ and on the coefficients 



J 

Fig. 2. Proximal operator pYox^f^{v) versus r; E M with A = 1 and /i : R ^ R, h{x) = Cj|x — bj\. 

j=i 


(47) 


can be written as; 

[proSA*.9...+i(‘^fc(* + l))]m 

Note that „i,i+i([^fc(*+l)]m)] is a function of the form ( |43| ) 
where, based on ( [30l l, the coefficients bj and cj are given by 
+ l)]m and pk(.a'^g{i), respectively, and the scalar v 
corresponds to the m-th component of the vector cf)k{i + 1). 
Using the boundedness of Tfc „j,z+i( ) in ( |44l i, we obtain: 

\Yk,m,i+l{[ 4 >k{'^ + ^)]m)\ < ^ 

l(iNk\C{k) 

(48) 

for all \4)k{i + l)]m- For the fi-norm (|^, we have: 

s™(*) = s/c= X! F/cr, (49) 

t£Mk\C{k) 

for all i and m = 1,..., M. For the reweighted fi-norm (|^, 
we have: 

Pki 


Xz) = 


ieMk\cik) 

' E 


|[< 5 fc,r(j - l)]r 

Pki 


t&Mk\C{k) 1 + 




< 


(50) 


for all i and m = 1,..., M. Using the proximal operator 
of ppkgk,i+i can be written as: 

,+i(^fc(* + l)) = 0fc(* + l)-Wferfc,i+i((^fc(i + l)), 

(51) 


where i+i(c/)^(i + 1)) is the M x 1 vector given by: 

rfc,i+i(0fe(t +1)) 

= coi|rfc_i^i+i([</)j,(i + i)]i,..., + i)]m)|- 

(52) 

As a consequence, the £ 2 -norm of the vector i+i(-) can be 
bounded as: 

||rfc,i+i(-)||2 < Sfc\/M,for the ^i-norm, (53) 

||rfc_i+i (-)||2 < ,for the reweighted fi-norm.(54) 


III. Stability analysis 
A. Error vector recursion 

We shall now analyze the stability of the multitask diffusion 
algorithm ( |26l l in the mean and mean-square-error sense. We 
first define at node k and iteration i the weight error vector 
Wk{i) = — Wk{i) and the intermediate error vector 

ipki't) — '^k ~ 4>k{'^)- Furthermore, we introduce the network 
vectors: 


w{l) 

= col . 


(55) 


= col{^i(i),.. 


(56) 


= col|^i(i),. 


(57) 
























Let A4 and TZx{i) be the MN x MN block diagonal matrices 
defined as: 

M = diag{/rfcJM}f=i (58) 

IZxii) = diagj ^ cikXi{i)xJ{i)^ ^ (59) 

eeAfknc(k) 

and be the MN x 1 block vector dehned as: 

Pzxi'i) = C^col{xk{i) (60) 

where C = C ® Im and C is the N x N right-stochastic 
matrix whose ffc-th entry is cik■ Let A.= A® Im where A 
is the N X N left-stochastic matrix whose £fc-th entry is aik- 
Subtracting w% from both sides of the hrst and second step 
in ( |26l ), and using the linear data model Q, we obtain: 

'^{i+1) = AJ[lMN-M.1tx{i)]w{i)-AJMp^^{i). (61) 

Subtracting from both sides of the third step in ( |2^ , and 
using result ( [5T| l, we get: 

Wk(i + 1) = + 1) + p/Tfe -I- 1)). (62) 

Hence, the network error vector for the diffusion strategy ( |26l ) 
evolves according to the following recursion: 

w{i -f 1) = A^[lMN-M'R.x{i)\ w{i) - AJMp^^{i)+ 
MT+ 1)), 

where ri+i(^(j -|- 1)) is the iV x 1 block vector whose A:-th 
block is given by ( |52] l, namely, 

+1)) — coi|rfc_i+i (^fc(* + (64) 

In order to make the presentation clearer, we shall use the 
following notation for terms in recursion ( |6^ : 

B{i) ^ A^[lMN-Mnx{i)], (65) 

g{i) = AJ Mp^^{i), (66) 

r(r + l) 4 pMr,+i(0(f + l)). (67) 

Hence, recursion ( |6^ can be rewritten as follows: 

w{i -I- 1) = B{i)w{i) — g{i) + r{i + 1). (68) 


B. Mean behavior analysis 

Taking the expectation of both sides of ( |68] l, using Assump¬ 
tion and E{p^^{i)} = 0, we obtain that the mean error 
vector evolves according to the following recursion: 

E{tf;(i + 1)} = B £{■£()(*)} -I- E{r(* -I-1)}, (69) 

where 

B ^ A^ilMN-MUx), (70) 

Ttx = E{7?.3,(i)}= diag| 

ieNknC{k) 

E{r(z + 1)} ^ 7;ME{r,+i(</.(f + 1))}. (72) 


The following theorem guarantees the mean stability of the 
multitask diffusion LMS ( |26| ) with forward-backward splitting. 

Recall that the block maximum norm of an x 1 block 
vector X = coljatfel^j^ and the induced block maximum norm 
of an iV X block matrix X are dehned as 0: 


11 ^ 11^,00 


max 

l<fc<iV 




II^IU,oc 


= max 

X 


II ^11 b,QO 

II ^ II b,oo 


(73) 


Theorem 1. (Stability in the mean) Assume data model 0 
and Assumption hold. Then, for any initial conditions, 
the multitask diffusion strategy ( |26| l converges in the mean 
to a small bounded region of the order of Pma.^, i-S-, 
limi^oo E{||{(>(f)||b_oo} = O(Mmax), if the step-sizes are 

chosen such that: 


0< Pk < 


^max(X].«G7VfcnC(fe) t^ikRx/) 


k = l,...,N, 

(74) 


where Pmax — maxi<fe<jv Pk tind Ai„ax(’) i^ the maximum 
eigenvalue of its matrix argument. The block maximum norm 
of the bias can be upper bounded as: 


lim ||E{t(;(i)}||b,oo < 

i—¥-oo 

lim ||E{{()(i)}||b,oo < 

i—foo 


g Pmax ^niaxV ^^ 

i-mib.oo ’ 

1 g pma.K '^max s/M 

e ■ 1-||B||6 .oo 


(75) 

(76) 


for the fi-norm and the reweighted fi-norm, respectively. 


Before proceeding, let us introduce the following assump¬ 
tions on the regression data and step-sizes. 

Assumption 1. (Independent regressors) The regression vec¬ 
tors Xk{i) arise from a zero-mean random process that is 
temporally white and spatially independent. 

It follows that Xk{i) is independent of wi{j) for i > j 
and for all i. This assumption is commonly used in adaptive 
hltering since it helps simplify the analysis. Furthermore, 
performance results obtained under this assumption match well 
the actual performance of stand alone hlters for sufficiently 
small step-sizes | |55) . 

Assumption 2. (Small step-sizes) The step-sizes pk are suffi¬ 
ciently small so that terms that depend on higher order powers 
of the step-sizes can be ignored. 


Proof: Iterating ( |69l l starting from f = 0, we arrive to the 
following expression: 

i 

E{m(f-hl)} = B*+^E{m(0)}+y]]B^E{r(f-hl-j)}, (77) 

J=o 

where E{’u)(0)} is the initial condition. E{'u)(i-|-1)} converges 
when j —>^ oo if, and only if, both terms on the RHS of ( |77| ) 
converges to hnite values. The hrst term converges to zero as 
f —oo if the matrix B is stable. A sufficient condition to 
ensure the stability of B is to choose the step-sizes according 
to ( |74l l (the proof can be obtained using the same arguments 
as 0 Theorem 5.1]). We shall now prove the convergence of 
the second term on the RHS of ^fl\ . To prove the convergence 
of the series + 1 ~ j)}^ it is sufficient to 

prove that the series E{r(f -I- 1 — j)Y\k converges 
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for k = 1,..., MN. A series is absolutely convergent if each 
term of the series can be bounded by a term of an absolutely 
convergent series | [42) . Since the block maximum norm of a 
block vector is greater than or equal to the largest absolute 
value of its entries, each term | [B^ E{r(i + 1 — j)}]fc| can be 
bounded as; 


I E{r(* + 1 - j)}]fc I < ■ ||E{r(z + 1 - 

< IIBIIlooW. (78) 

The quantity ||E{r(* + 1 — j)}||b,oo is finite for all i and j 
and bounded by some constant rmax = C>(Atmax)- In fact, 
from ( |72l l, we have; 

||E{r(z + l)}||b,oo < ??/rtnax||E{rj+i(^(i + l))}||&,oo (79) 

since ||AT||b_oo = Mmax- Using (|5^-(|54li, the block maximum 
norm of + 1)) in (|^i can be bounded as; 

||ri+i((/)(j + l))||b,oo < (fi-norm) (80) 

||ri+i((/)(i + l))||b,oo < (rew. ^i-norm) (81) 

e 

for all i, where 5max = max Sfc. If the step-sizes are chosen 

i<fc<Ar 

according to the series |!®|lb,oo^max is absolutely 

convergent. Therefore, the series E{r(i + 1 — j)}]k 

is an absolutely convergent series. 

Note that when i —> oo, the block maximum norm of the 
bias can be bounded as 


lim ||E{ru(i)}||f,,oo = lim || ^ B^ E{r(i + 1 - j)} 

i —^ f-v~i 9 —V rv~i » ^ 


j=o 


b.oo 


- II®'’ ^ “ j)}iib.o 


j=o 


< \im J2\mi 

1.— • * ’ 


oo' rnax 




1- l|iS||b,oo’ 
(82) 


C. Mean-square-error stability 

We examine the mean-square-error stability by studying the 
convergence of the weighted variance E{||'u;(i)|l|;}, where S 
is a positive semi-definite matrix that we are free to choose. 
Evaluating the variance, we obtain; 

E{\\w{t + l)|i|} =E{||a(*)|l|,} + E{||g(z)|||}+ 
ip{r{i -h 1), S, B{i), w{i),g{i)), 
where S' = EjS^(i)SB(i)} and 

ip{r{i -f l),B{i),w{i),g{i)) = E{||r(i -f 1)|1|}+ 
2E{rT(i + l)-EB{i)w{i)} - 2E{r^{i + l)S£/(i)} 

(84) 

is a term coming from promoting relationships between clus¬ 
ters. The last two terms on the RHS of ( [84l l contain higher- 
order powers of the step-sizes. Using Assumption we get 
the following approximation; 

ip{r{i-\-l), w{i)) « E{||r(i-|-l)|l|.}-|-2E{r^(*-|-l)SiB'u3(i)} 

(85) 


Let (T = vec(S) and cr' = vec(S') where the vec(-) operator 
stacks the columns of a matrix on top of each other. We will 
use the notation \\w'^^ and Hihlll, interchangeably to denote 
the same quantity ui^'Sw. Using the property vec(LfSVU) = 
{W^ 0 [/)vec(S), the relation between cr' and cr can be 
expressed in the following form; 

cr' = :Fcr, (86) 

where IF is the (LN)"^ x (LN)^ matrix given by; 

J^^E{B^{i)0B^{i)}^B^ 0B^. (87) 

The approximation in ( [87l l is reasonable under Assumption 
0- Introducing the matrix G: 

G ^ E{g(i)gT(*)} = MC^diag{R^^kal k}k=iCMA 

( 88 ) 

and using the property Tr(SX) = [vec(X^)]^vec(S), the 
second term on the RHS of ( |83l l can be written as; 

E{||9(*)|||} = [vec(GT)]T^. (89) 

Hence, the variance recursion ( |83| l can be expressed as 

E{\\w{z + 1)||2 } = E{|ia(*)||^^} + [vec(G^)]^<T+ 

-f l),cr,ii)(i)). 

Theorem 2. (Mean-square-error Stability) Assume data 
model Q and Assumptions^and^hold. Then, for any initial 
conditions, the multitask diffusion strategy \26) is mean-square 
stable if the error recursion is mean stable and the matrix 
T is stable. Using the approximation the matrix fF is 

stable if the step-sizes satisfy 

Proof: Since S is a positive semi-definite matrix and the 
vector r(i-l-l) is uniformly bounded for alH, E{||r(z-|-l)|||.} 
can be bounded as 


0<E{||r(i +1)111} <Ki (91) 

for all i, where ki is a positive constant. Since r{i -f 1) is 
uniformly bounded for all i, the vector 2r^ {i + 1)SB is 
also bounded for all i. Let 7niax be a bound on the largest 
component of 2r^ {i -f l)SiH in absolute value for all i. We 
obtain 

MN 

2|E{r^(i-f l)SBm(i)}| 'Tmax |E{u>^(i)}| 

e=i 

= 7max • ||E{t()(*)}||i. (92) 

Under condition ( |74| ) on the step-sizes, the mean error vector 
E{{()(z)} converges to a small bounded region as z —>■ oo. 
Hence, ||E{'u;(z)}||i can be upper bounded by some positive 
constant scalar K 2 for all i, and using the approximation ( [85] l, 
|(/?(r(z-f 1), cr, ■2;(z))| satisfies; 


\p{rii + l),(T,w{i))\ < Ki-\- 

Tmax^2 


(93) 


for all i. The positive constant ks = ki + 7 maxK 2 can be writ¬ 
ten as a scaled multiple of the positive quantity [vec(G^)]^cr 


as K 3 = f[vec(G^)]^cr where t > 0 |42 
following inequality for ( [90l i; 

E{||a(* + l)||2}<E{|lm(U"2 


. We arrive at the 


+ (1 + 0 • [vec(G )] ■ cr. 


xT,]Tc 

(94) 
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Iterating ( |94l l starting from j = 0, we obtain 

E{|1*(* + 1)||^} 

< E{|ia(0)||^.+.^} + (1 + f)[vec(G^)]^ ^ 

j=o 

where E{||{(3(0)|p} is the initial condition. If we show that 
the RHS of ( |9^ converges, then E{||{i3(j + 1)11^} is stable. 
The first term on the RHS of ( |95| l vanishes as * —oo if the 
matrix JF is stable. Consider now the second term on the RHS 
of ( [95] l. The series converges if 

converges for k = 1,..., {MN)'^. Each term of the series 
can be bounded as 




Node number k 


(b) Regression and noise variances. 


Fig. 3. Experimental setup. 


< \[J^^(T]k\ < ||F’'^cr||b_oo < ll^-^llb.oo • Ikllb.oo- 

(96) 

Since JF is stable, there exists a submultiplicative norrrQ || • |jp 
such that IIF”! Ip = C < 1- All norms are equivalent in finite 
dimensional vector spaces. Thus, we have; 

||:^'||6,oo<r||:E^||p<T||:E|Pp = TC^ (97) 

for some positive constant t. Considering this bound with ( |9^ 
yields: 

oo oo oo 

j=o j=o j=o 

_ T ■ Ho'llb.oo 

1-C ■ 

(98) 

As a consequence, since the second term on the RHS of ( |95l ) 
converges to a bounded region when F” is stable, E{||{f>(i + 
1)11^} also converges. ■ 


IV. Simulation results 


Before proceeding, we present a new rule for selecting the 
regularization weight pM based on a measure of sparsity of 
the vector w'j. — wl. The intuition behind this rule is to employ 
a large weight when the objectives at nodes k and £ have 
few distinct entries, i.e., is sparse, and a small weight 

Pke when the objectives have few similar entries, i.e., — 

is not sparse. Among other possible choices for the sparsity 
measure, we select a popular one based on a relationship 
between the £i-norm and £ 2 -norm fSb]: 


= 


M 


1 - 


111 


M -VM\ y/M 


Vll2 


G [0,1]. 

(99) 


The quantity £,{w1 — w°f) is equal to one when only a single 
component of is non-zero, and zero when all elements 

of — Wg are relatively large | [5^ . Since the nodes do 
not know the true objectives and we propose to 
replace these quantities by the available estimates at each time 


* The norm || ■ ||p is called submultiplicative if for any square matrices U 
and W of compatible dimensions we have: ||I7W llp<l|V||p-||W||p. 


instant i and allow the regularization factors to vary with time 
according to; 


Pkt(i) oc 


M G _ ll<Afc(I + l)-</>(!(t + l)l|l ^ 

M-y/M\ VM-||</>fc(i+l)-</>A*+l)ll2 / ’ 

if £ € A4 \ C{k) (100) 

0, otherwise 


where the symbol ex denotes proportionality. As we shall 
see in the simulations, this rule improves the performance of 
the algorithm and allows agent k to adapt the regularization 
strength pkt with respect to the sparsity level of the vector 
— Wg at time instant i. 


A. Illustrative example 

We consider a clustered network with the topology shown 
in Fig. [^a), consisting of 20 nodes divided into 3 clusters; 
Cl = {1,..., 10}, C 2 = {11,..., 15}, and C 3 = (16,..., 20}. 
The regression vectors Xk{i) are 18 x 1 zero-mean Gaussian 
distributed vectors with covariance matrices 
The variances cr^ are shown in Fig. |^b). The noises Zk{i) 
are zero-mean i.i.d. Gaussian random variables, independent 
of any other signal, with variances cr^ ^ shown in Fig.^b). Fet 
card} } denote the cardinal of its entry. We run the mffusion 
algorithm @ by setting cik = card{Vnc(£)} ^ ^ W}nC(f) 
and ank = card{AffcnC(fc)} ^ ^ A4nC(fc). The regularization 
weights are set to pki = cai-d{AfAC(fc)} ^ ^ \ C(fc). We 

use a constant step-size p, = 0.02 for all nodes, a sparsity 
strength p = 0.06 for the £i-norm regularize^ and p = 0.04 
for the reweighted fi-norm regularizer with e = 0.1 . The 
results are averaged over 200 Monte-Carlo runs. 

The optimum vectors are set to w^. = w° + Sc 
at each cluster with w° an 18 x 1 vector whose entries 
are generated from the Gaussian distribution A/^(0,1). First, 
we set Jci to Sc^ to [-1 Oixir]^, and Sc^ to 

[0lx6 ~ 1 Oixii]^. Observe that at most two entries dif¬ 
fer between clusters. After 500 iterations, we set 6 c 2 to 
[—lix3 1 Oixiq]^ and <5 c 3 to [0ixi2 — lix3 0ix3]^- In 
this way, at most 7 entries differ between clusters. After 1000 
iterations, we set Sc^ to [-lix 3 lixs - lix 3 Oixg]^ and 
Sc 3 to [0ix9 lix 3 —lix 3 lixs]^- Thus, at most 18 entries 
now differ between clusters. 

In Fig. we compare 6 algorithms: the non-cooperative 
FMS (algorithm ( |26l l with A = C = In and rj = 0), the 
regularized FMS (algorithm ( |26l l with A = C = In) with 
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Fig. 4. Network MSD comparison for 6 different strategies: non-cooperative 
LMS (algorithm \26) with A = C = 1, rj = 0), spatially regularized 
LMS (algorithm (26\ with A = C = Ij\r with £i-norm and reweighted £i- 
norm, standard diffusion without cooperation between clusters (algorithm ^26) 
with r} = 0), and our proximal diffusion {26) with £i-norm and reweighted 
£l-norm. 


£i-norm and reweighted £i-norm, the multitask diffusion LMS 
without regularization (algorithm ( |26l ) with rj = 0 ), and the 
multitask diffusion LMS ( |26] l with £i-norm and reweighted ii- 
norm regularization. As observed in this figure, when the tasks 
share a sufficient number of components, cooperation between 
clusters enhances the network MSD performance. When the 
number of common entries decreases, the cooperation between 
clusters becomes less effective. The use of the £i-norm can 
lead to a degradation of the MSD relative to the absence of co¬ 
operation among clusters. However, the use of the reweighted 
fi-norm allows to improve the performance. 

In order to better understand the behavior of the algo¬ 
rithm ( |26] l in the clusters, we report in Fig. the learning 
curves for i G [ 0 , 1000 ] of the common and distinct entries 
among clusters given by 

cardie I ^ ^ (101) 

for j = 1, 3, where H(i) is the set of identical (or distinct) 
components among all clusters at iteration i and is 

the optimum parameter vector at node k and iteration i. For 
example, for i G [0, 500], the set of distinct components is 
{1, 7}. As shown in this figure, cluster C 3 benefits considerably 
from cooperation with other clusters in the estimation of the 
common entries. Nevertheless, cluster Ci benefits slightly from 
cooperation. This is due to the fact that the performance of 
C 3 is low relatively to that of Ci since the SNR in C 3 is small 
and the number of nodes employed in this cluster is 5. 

We shall now illustrate the effect of the regularization 
strength rj over the performance of the algorithm for different 
numbers of common entries between the optimum vectors w'^. 
We consider the same settings as above, which means that the 
number of common entries among clusters is successively set 
to 16, 11, and 0 over 18. Parameter p is uniformly sampled 
over [0,0.14]. Figure shows the gain in steady-state MSD 
versus the unregularized algorithm obtained for p = 0 , as a 
function of p. For each rj, the results are averaged over 50 
Monte-Carlo runs and over 50 samples after convergence of 



Fig. 7. Network MSD comparison for the same 6 different strategies 
considered in Fig. [fusing adaptive regularization factors 


the algorithm. It can be observed in Fig. that the interval for 
T] over which the network benefits from cooperation between 
clusters becomes smaller as the number of common entries 
decreases. In addition, the reweighted fi-norm regularize!' 
provides better performance than the fi-norm regularizes 

In order to guarantee a correct cooperation among clusters, 
we repeat the same experiment as Fig. [fusing the adaptive rule 
in ( |100| l for adjusting the regularization factors pke- The pro¬ 
portionality coefficient in ( |100| ) is set equal to one. As shown 
in Fig. 1^ when the number of distinct components is small, 
both £i and reweighted £i-norms yield better performance than 
the diffusion LMS with rj = 0. When the number of distinct 
components increases {i G (1000,1500]), the performance of 
strategy ( |26l l with ^i-norm gets closer to diffusion LMS with 
T] = 0, while the reweighted £i-norm still guarantees a gain. 
Thus, the mechanism proposed in ( |100| l for the selection of 
the regularization factors improves the cooperation between 
nodes belonging to distinct clusters. 

Finally, we compare the current multitask diffusion strat¬ 
egy ( | 2 ^ with two other useful strategies existing in the liter¬ 
ature 124), 129). We consider a stationary environment where 


the optimum parameter vectors {rUc consist of a sub¬ 
vector of 16 parameters of global interest to the whole net¬ 
work and a 2 X 1 sub-vector } of common interest to nodes 
belonging to cluster Cj, namely, tug = col{4°,^g }. The 
entries of ^°, , ^g^, and ^g^ are uniformly sampled from a 

uniform distribution 7/(—3, 3). Except for these changes, we 
consider the same experimental setup described in the first 
paragraph of the current section. When applying the strategy 
developed in l24), we assume that node k belonging to cluster 


Cj is aware that the first 16 parameters of tug are of global 
interest to the whole network while the remaining parameters 
are of common interest to nodes in cluster Cj. However, the 
current method ( | 2 ^ and the algorithm in p9) do not require 
such assumption. We run the ATC D-NSPE strategy developed 
in 1 24 1 using uniform combination weights ag(, = 1 /card{A 4 } 
fori G A4 and = l/card{A4nC(fc)} fori G A4nC(fc), 
and uniform step-sizes pk = 0.02 Vfc. We run the multitask 
diffusion strategy developed in by setting {cik^aek, Pke} 
in the same manner described in the first paragraph of the 
current section, pk = 0.02 Vfc, and p = 0.06. The learning 
curves of the algorithms are reported in Eig. As expected, it 
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(a) Cluster 1 MSD over identical entries. 


(b) Cluster 3 MSD over identical entries. 




(c) Cluster 1 MSD over distinct entries. 


(d) Cluster 3 MSD over distinct entries. 


Fig. 5. Clusters MSD over identical and distinct components. Comparison for the same 6 different strategies considered in Fig. 



(a) fi-norm. 



Fig. 6. Differential network MSD (MSD(? 7 ) — MSD(r 7 = 0)) in dB with respect to the regularization strength rj for the multitask diffusion LMS {26i with 
£l-norm (left) and reweighted £i-norm (right) for 3 different degrees of similarity between tasks. Experiment 1: at most 2 entries differ between clusters. 
Experiment 2: at most 7 entries differ between clusters. Experiment 3: at most 18 entries differ between clusters. 


can be observed that the cooperation between clusters based on 
the £ 2 -norm degrades the performance relative to the case 
of non-cooperative clusters, i.e., rj = 0. Indeed, the multitask 
diffusion strategy developed in | [29) considers squared £ 2 - 
norm co-regularizers to promote the smoothness of the graph 
signal, whereas, in the current simulation we need to promote 
the sparsity of the vector — w'^. Furthermore, when the 
reweighted £i-norm is used, our method is able to perform 
well compared to the strategy developed in p4| that requires 
the knowledge of the indices of common and distinct entries 


in the parameter vectors. We note that recent unsupervised 
strategies 157), 1^ dealing with group of variables rather 
than variables propose to add a step in order to adapt the 
cooperation between neighboring nodes based on the group at 
hand. It is shown in pT) that the performance depends heavily 
on the group decomposition of the parameter vectors. 

B. Distributed spectrum sensing 

Consider a cognitive radio network composed of Np pri¬ 
mary users (PU) and Ns secondary users (SU). To avoid caus¬ 
ing harmful interference to the primary users, each secondary 
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Fig. 8. Network MSD comparison for 5 different strategies: standard 
diffusion without cooperation between clusters (algorithm {26\ with r} = 0), 
our proximal diffusion {26) with £i-norm and reweighted £i-norm, the ATC 
D-NSPE algorithm developed in |24|, a nd the multitask diffusion strategy 
with squared £ 2 -norm coregularizersTS^. 


user has to detect the frequency bands used by all primary 
users, even under low signal to noise ratio conditions Q, p4) , 
| [59] . We assume that the secondary users are grouped into Q 
clusters and that there exists within each cluster a low power 
interference source (IS). The goal of each secondary user is 
to estimate the aggregated spectrum transmitted by all active 
primary users, as well as the spectrum of the interference 
source present in its cluster. 

In order to facilitate the estimation task of the secondary 
users, we assume that the power spectrum of the signal 
transmitted by the primary user p and the interference source 
q can be represented by a linear combination of Nb basis 
functions (pmif)' 


Spif) = 

Nb 

..,Np, 

(102) 

S,if) = 

m—1 

Nb 


(103) 

where apm, Pqm 

m—1 

are the combination weights, and / 

is the 


normalized frequency. Each secondary user k G Cq has to esti¬ 
mate the Nb x {Np + 1) vector = col{a°,..., , ,9°} 

where a° = [api,... ^apNsV (3° = [I3qi,... ^^gNsV ■ 
Let ip^k(i) denote the path loss factor between the primary 
user p and the secondary user k at time i. Let also iq i^{i) 
denote the path loss factor between the interference source q 
and the secondary user k at time i. Then, the power spectrum 
sensed by node k G Cq 2 A. time i and frequency fj can be 
expressed as follows: 

Np 

(104) 

where Zkj{i) is the sampling noise at the j-th frequency 
assumed to be zero-mean Gaussian with variance cr^, . At 
each time instant i, node k observes the power spectrum over 
Np frequency samples. Let rfc(i) and Zk{i) be the Np x 1 
vectors whose j-th entries are rk,j{i) and Zk,j{i), respectively. 
Using ( |104| ), we can establish the following linear data model 



Fig. 9. A cognitive radio network consisting of 2 primary users and 13 
secondary users grouped into 4 clusters containing each an interference source 


IS. 


for node k G Cq'. 

rkii) = ^k{i)rl + Zkii), (105) 

where ^k{i) = [h,k{i), ■ ■., with the 

Np X Nb matrix whose j-th row contains the magnitudes of 
the Np basis functions at the frequency sample fj. 

To show the effect of multitask learning with sparsity- 
based regularization, we consider a cognitive radio network 
consisting of Np = 2 primary users and Ns = 13 secondary 
users decomposed into 4 clusters as shown in Fig. The 
power spectrum is represented by a combination of Np = 20 
Gaussian basis functions centered at the normalized frequency 
fm with variance cr^ = 0.001 for all m: 

</'m(/) = exp , (106) 


where the central frequencies fm are uniformly distributed. 
The combination vectors are set to: 

= [0ix4 1 1 0ixi4,0ixi4 1 1 Oix4,0 0.3 0.3 Oixir]^ 

''^C 2 “ [0lx4 1 1 0ixi4,0ixi4 1 1 Oix45 0ix2o]^ 

= [0ix4 1 1 0ixi4,0ixl4 1 1 Oix4,0 0.3 0ixi6 0.3 0]^ 

= [0lx4 1 1 0ixl4,0ixl4 1 1 0lx4,0ixl7 0.3 0.3 0]^. 

(107) 


We consider Np = 80 frequency samples. Based on the 
free propagation theory, we set the deterministic path loss 
factor Ip^k to the inverse of the squared distance between 
the transmitter p and the receiver k. At time instant i, we 
set (.p^k{i) = ^p,k + S£p^k{i) with S£p^k{i) a zero-mean 
random Gaussian variable with standard deviation O.llp^k- The 
secondary user k estimates tp,k{i) according to the following 
model: 


(t) 


0, otherwise 


(108) 


with io a threshold value. The same rule is used to set the path 
loss factor between the interference sources and the secondary 
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users. We run the ATC diffusion algorithm 
following adaptation step; 


with the 


+ (109) 

with Tfc(i) the estimate of at time instant i. The sampling 
noise Zkej{i) is assumed to be a zero-mean random Gaussian 
variable with standard deviation 0.01. The combination coef- 
hcients {a^k} and regularization factors {pke} are set in the 
same way as in the previous experimentation. 

The MSD learning curves are averaged over 50 Monte- 
Carlo runs. We run the multitask diffusion LMS ( |26| ) in two 
different situations. In the hrst scenario, we do not allow any 
cooperation between clusters by setting ry = 0. In the second 
scenario, we set the regularization strength p to 0.01 and we 
use the fi-norm as co-regularizing function. As can be seen 


in Fig. 10 the network MSD performance is signihcantly 


improved by cooperation among clusters. For comparison 
purposes, we also run the ATC D-NSPE strategy developed 
in p4| and the multitask diffusion strategy with ^ 2 -norm 


developed in |29|. For the ATC D-NSPE strategy we assume 
that nodes are aware that the first Np x Np components of 
the vector are of global interest to the whole network and 
that the remaining components are of common interest to the 


cluster C{k). The link weights {a^kjCik, PktTa,f^.,ap} / 
set in the same manner as the experiment in Eig. It can be 
observed from Eig. 10 that our strategy performs well without 


the need to know the parameters of global interest and the 
parameters of common interest during the learning process. 
Eigure [m shows the estimated power spectrum density for 
nodes 2, 4, 7, and 13 when running the multitask diffusion 
strategy ( |26] l with 77 = 0 (left) and 77 = 0.01 (right). In the 
left plot, we observe that the clusters are able to estimate their 
interference source. However, depending on the distance to 
the primary users, the secondary users do not always succeed 
in estimating the power spectrum transmitted by all active 
primary users. Eor example, clusters 1 and 2 are not able to 
estimate the power spectrum transmitted by PU2. As shown 
in the right plot, regardless of the distance between primary 
and secondary users, each secondary user is able to estimate 
the aggregated power spectrum transmitted by all the primary 
users and its own interference source by cooperating with 
nodes belonging to neighboring clusters. 


V. Conclusion and perspectives 

In this work, we considered multitask learning problems 
over networks where the optimum parameter vectors to be 
estimated by neighboring clusters have a large number of 
similar entries and a relatively small number of distinct entries. 
It then becomes advantageous to develop distributed strategies 
that involve cooperation among adjacent clusters in order 
to exploit these similarities. A diffusion forward-backward 
splitting algorithm with ^i-norm and reweighed £i-norm co- 
regularizers was derived to address this problem. A closed- 
form expression for the proximal operator was derived to 
achieve higher efficiency. Conditions on the step-sizes to 
ensure convergence of the algorithm in the mean and mean- 
square sense were derived. Finally, simulation results were 



Fig. 10. Network MSD comparison for 4 different algorithms: standard 
diffusion LMS without cooperation between clusters (r) = 0), our proximal 
diffusion {26\ with ^i-norm regularizer, the ATC D-NSPE algorithm devel¬ 
oped in |24|, and the multitask diffusion strategy |29|. 


presented to illustrate the benefit of cooperating to promote 
similarities between estimates. Future research efforts will be 
focused on exploiting other sparsity promoting co-regularizers. 
Perspectives also include the derivation of other forms of 
cooperation depending on prior information. 
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